Skip to content

feat(daemon): process supervision — llama-server lifecycle + status#69

Merged
OpenCircuitDev merged 4 commits into
mainfrom
feat/process-supervision
Jun 11, 2026
Merged

feat(daemon): process supervision — llama-server lifecycle + status#69
OpenCircuitDev merged 4 commits into
mainfrom
feat/process-supervision

Conversation

@OpenCircuitDev

Copy link
Copy Markdown
Owner

Summary

Track 1, item 2 from docs/AGENT_OPERATIONS.md — activate the dead-code supervisor module. The daemon can now spawn + supervise its own llama-server instead of requiring the user to hand-run it, with health-gated restart, exponential backoff, max-restart budget, and a Tauri status command for the UI to surface failure.

Ollama is intentionally not supervised here (it has its own service installer + lifecycle); the spawn-gate refuses when backend = "ollama", and the module doc explains why.

What changed

Rust (crates/ocm-daemon/)

  • settings — one new field: llama_server_binary: Option<String>. #[serde(default)] for forward-compat. Doc explicitly notes "no-op when backend = ollama".
  • supervisor — was already a partial module (Supervisor struct, spawn helpers, wait_for_http_ready); now activated. Added:
    • SupervisorStatus enum (Serialize, snake_case tag): NotSpawning / Starting / Running { pid } / Restarting { attempt, last_error } / FailedAfterMaxRestarts { attempts, last_error } / Stopped.
    • SupervisorPolicy with defaults: max_restarts=3, initial_backoff=500ms, max_backoff=10s, stability_window=60s, health_check_interval=5s, health_check_timeout=15s.
    • compute_backoff(attempt, initial, max) — exponential, u8::MAX-safe (no shift overflow).
    • supervise(supervisor, policy, status, shutdown) — health-gated restart loop. Every wait is tokio::select!-raced against the shutdown signal. Stability-window reset means a process that ran healthy then crashed isn't penalized as a flap. tracing::error when the budget is exhausted.
    • Module-level doc clarifies Ollama no-spawn (own service installer; we'd fight ollama-svc's restart logic).
    • spawn_vllm_server kept (already tested) but explicitly #[allow(dead_code)] — NVIDIA supervision is a separate follow-up.
  • bootstrap — wire-points:
    • should_spawn_llama_supervisor(settings, models_dir) -> bool — the spawn-gate decision: requires backend = LlamaCpp AND llama_server_binary.is_some() AND the model GGUF exists at models_dir/<model_id>.gguf (matching ocm_models::downloader convention). Six test cases cover the matrix.
    • build_llama_supervisor(settings, models_dir) -> Option<(Arc<Supervisor>, SupervisorPolicy)> — resolves binary + model path + port (from inference_base_url) + health URL (/v1/models), returns None when the gate refuses.
    • LlamaSupervisorState — what main.rs app.manage()s. Holds the shared status Arc<Mutex<_>> plus the shutdown watch::Sender (in Mutex<Option<_>> so a future RunEvent::ExitRequested hook can .take() it). Dropping the state is sufficient for clean shutdown in v0.1.2; signal_shutdown stub is included for the future hook.
  • commandsget_supervisor_status Tauri command, returns the live SupervisorStatus.
  • main — supervisor wired into setup(): build → spawn supervise loop on tauri runtime → app.manage(supervisor_state). Falls cleanly back to NotSpawning when the gate refuses. Removed the #[allow(dead_code)] mod supervisor; annotation (it's live now).

FrontendSettings interface gains llama_server_binary; settings page gains a path input with placeholder communicating None=do-not-spawn.

TDD audit trail

Commit CI Run Verdict
115286e test: RED 27379971603 ❌ expected (compile-fail on missing symbols)
35849fc feat: GREEN 27380885416 ❌ rustfmt drift on 3 multi-line decisions
84e746b fix: rustfmt + frontend 27381591386 ❌ clippy: unused_assignments, dead_code on intentional-future-use code
44253c4 fix(clippy) 27381788960 ✅ Rust ubuntu/macOS/windows, all tests pass
(latest, frontend) 27381591372 ✅ Frontend CI

New test coverage (+17 tests)

  • supervisor.rs (+5): SupervisorStatus::default(); SupervisorPolicy::default() lockstep with constants; compute_backoff schedule (incl. u8::MAX safety); supervise() integration Phase 1: Foundation — Tauri shell, paths, settings, CI #1: immediate-exit Command + unresponsive health URL → exactly max_restarts=2 attempts → FailedAfterMaxRestarts; supervise() integration Phase 0: Bench framework scaffold + first isolation sandbox #2: long-running sleep + early shutdown signal → Stopped + child reaped.
  • settings.rs (+3): default llama_server_binary == None; TOML round-trip; legacy file (no field) still parses.
  • bootstrap.rs (+9): full spawn-gate decision matrix (6 cases — yes / Ollama-no / Auto-no / binary-None-no / file-missing-no / model_id-None-no), parse_port helper (2 cases), build_llama_supervisor returns None when gate refuses.

ocm-daemon test count: 25 → 42 (verified in run 27381788960 log).

Design choices made

Per AGENT_OPERATIONS "NEEDS_APPROVAL when not covered by spec": the operator declined the multi-choice up front, so I took my own recommended paths and documented them:

  • Status surface: new Tauri command get_supervisor_status + tracing::error on FailedAfterMaxRestarts. No tray-icon hint, no UI panel in this PR (UI polish = Track 1 item 3).
  • ctx_len: hardcoded DEFAULT_LLAMA_CTX_LEN = 4096 const (matches implementation-plan example). No new Settings field; revisit in item 3 if needed.
  • Spawn-gate conservatism: if model_id is set but the GGUF doesn't exist on disk, refuse to spawn rather than burn the restart budget on a server with nothing to load. Chat fails loudly via the existing "backend not reachable" message instead.
  • Auto backend → no spawn: explicitly opt-in via backend = "llamacpp". Preserves pre-v0.1.2 behavior for users who never opted into supervision.

Test plan

  • cargo clippy --workspace --all-targets -- -D warnings — green on ubuntu/macOS/windows (run 27381788960)
  • cargo test --workspace — 42 ocm-daemon tests + others green on all 3 platforms (run 27381788960)
  • Frontend npm run check — 258 files, 0 errors (local)
  • Frontend Frontend CI workflow — green on Node 20 + Node 22 (run 27381591372)
  • Manual smoke (operator) — point llama_server_binary at a real binary, download a registry model, launch cargo tauri dev, observe llama-server spawned + supervisord. Not run here (no local Rust toolchain).
  • Manual failure-mode smoke (operator) — point at a binary that exits immediately; observe FailedAfterMaxRestarts after 3 attempts via get_supervisor_status (frontend hook is follow-up).

Out of scope (deferred)

  • vLLM supervisionspawn_vllm_server helper still exists, still tested, but explicitly not wired (heavier preconditions; separate follow-up).
  • RunEvent::ExitRequested hook — current clean-shutdown is Drop-of-watch::Sender (sufficient; the loop notices, calls Supervisor::stop(), sets Stopped). An explicit hook + join-handle wait would be more graceful; signal_shutdown stub is ready for it.
  • UI status panel — Tauri command exists, no frontend display yet. Track 1 item 3 territory.

🤖 Generated with Claude Code

Brand and others added 4 commits June 11, 2026 15:55
…wn-gate

TDD red-pass for Task 2 (Track 1 item 2). Tests reference symbols that
don't yet exist; CI compile-fail IS the red. The green commit follows.

supervisor.rs:
- SupervisorStatus default = NotSpawning
- SupervisorPolicy default uses documented constants
- compute_backoff doubles then clamps at max (u8::MAX-safe)
- INTEGRATION: supervise() with immediate-exit Command + unresponsive
  health URL surfaces FailedAfterMaxRestarts after exactly max_restarts
- INTEGRATION: supervise() returns cleanly with Stopped on shutdown signal

settings.rs:
- Settings::default().llama_server_binary == None
- TOML round-trip for the new field
- Legacy settings.toml without the field still parses (None)

bootstrap.rs (the spawn-gate decision matrix):
- LlamaCpp + binary set + model file present → SPAWN
- backend = Ollama → DO NOT SPAWN (per directive: Ollama supervises itself)
- backend = Auto → DO NOT SPAWN (preserve pre-v0.1.2 behavior; opt-in only)
- llama_server_binary = None → DO NOT SPAWN (directive's preserve-current clause)
- model file missing on disk → DO NOT SPAWN (conservative; chat fails with
  the existing 'backend not reachable' message rather than burn the budget)
- model_id unset → DO NOT SPAWN

Design choices made (operator declined the multi-choice; recommended paths taken):
- Status surface = new Tauri command + tracing::error! on Failed (not a
  tray-icon hint; UI panel is Track 1 item 3)
- ctx_len = hardcoded 4096 const (no new setting field this PR)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uri status

Minimal impl to satisfy the RED tests. Settings opt-in, Ollama no-spawn,
clean shutdown via tokio watch channel.

supervisor.rs:
- SupervisorStatus enum (NotSpawning/Starting/Running/Restarting/
  FailedAfterMaxRestarts/Stopped) — Serialize for Tauri command return
- SupervisorPolicy: max_restarts=3, initial_backoff=500ms, max_backoff=10s,
  stability_window=60s; per-test override OK
- compute_backoff(attempt, initial, max): exponential, u8::MAX-safe
- supervise(): start → wait_for_http_ready → monitor_until_dead loop with
  shutdown-aware tokio::select on every wait. tracing::error on
  FailedAfterMaxRestarts. Stability-window reset for "stable, then crashed"
  process.
- Module-level doc explicitly states Ollama is NOT supervised here and why
  (own service installer, would fight ollama-svc's restart logic).

settings.rs:
- llama_server_binary: Option<String>, #[serde(default)] — forward-compat

bootstrap.rs:
- should_spawn_llama_supervisor(settings, models_dir): the decision matrix.
  All five "no" branches covered by RED tests.
- build_llama_supervisor(settings, models_dir) -> Option<(Arc<Sup>, Policy)>:
  resolves binary + model path + health URL + port; returns None when
  spawn-gate refuses (so main.rs branches cleanly).
- parse_port helper (tiny — avoids pulling in `url` crate).
- LlamaSupervisorState: holds Arc<Mutex<SupervisorStatus>> + the shutdown
  watch::Sender (in Mutex<Option<_>> so an exit hook can .take() it later).

commands.rs:
- get_supervisor_status Tauri command — returns the live status enum
  (Serialize-flat, snake_case tag).

main.rs:
- Stop allowing-dead-code on supervisor (it's live now).
- Setup branch: build supervisor → spawn supervise loop on tauri runtime →
  manage state. NotSpawning fallback when spawn-gate refuses.
- Register get_supervisor_status in invoke_handler.
- Clean shutdown: dropping LlamaSupervisorState drops the watch::Sender,
  which signals the supervise loop, which calls Supervisor::stop() (kills
  child) and sets Stopped. Supervisor::Drop is the belt; watch is suspenders.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Format drifts caught by CI (no local rustfmt on this machine):
- main.rs: assignment line-break before match
- supervisor.rs: info!() multi-line; compute_backoff multi-arg break; assert_eq! split

Frontend:
- settings.ts: llama_server_binary field on Settings interface
- settings/+page.svelte: input row, placeholder communicates None=do-not-spawn
- npm run check clean (258 files, 0 errors)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three clippy gripes on the supervisor:

1. supervise() loop: `last_error` init value provably unread on all real
   paths but is sound as a sentinel — annotate #[allow(unused_assignments)]
   rather than restructure with Option (cheaper for a guaranteed-overwritten
   variable).

2. spawn_vllm_server: kept for the future NVIDIA-supervision path (already
   tested), not wired into bootstrap this PR. #[allow(dead_code)] with
   a doc comment explaining the deferred path.

3. LlamaSupervisorState.shutdown: observed via Drop semantics (dropping the
   watch::Sender wakes the supervise loop), not direct reads. Annotated and
   accompanied by a stub signal_shutdown method for the future ExitRequested
   hook. dead_code allow scoped to the field + method.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@OpenCircuitDev OpenCircuitDev merged commit 757612a into main Jun 11, 2026
8 checks passed
OpenCircuitDev pushed a commit that referenced this pull request Jun 11, 2026
#56 verdict

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant